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ABSTRACT 

Context. Cosmological measurements require the calculation of nontrivial quantities over large datasets. The next generation of survey 
telescopes (such as DES, PanSTARRS, and LSST) will yield measurements of billions of galaxies. The scale of these datasets, and 
the nature of the calculations involved, make cosmological calculations ideal models for implementation on graphics processing units 
(GPUs). 

Aims. We consider two cosmological calculations, the two-point angular correlation function and the aperture mass statistic, and aim 
to improve the calculation time by constructing code for calculating them on the GPU. 

Methods. Using CUDA, we implement the two algorithms on the GPU and compare the calculation speeds to comparable code run 
on the CPU. 

Results. We obtain a code speed-up of between 10 - 180x faster, compared to performing the same calculation on the CPU. The code 
has been made publicly available. 

Conclusions. GPUs are a useful tool for cosmological calculations, even for datasets the size of current surveys, allowing calculations 
to be made one or two orders of magnitude faster. 

Key words, cosmological calculations - aperture mass - angular correlation function - GPU - CUDA - scientific computation - 
cosmology 
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1. Introduction 

The use of graphics processing units (GPUs) in scientific com- 
puting has been steadily growing in fields as diverse as bioin- 
formatics, QCD lattice calculations and seismology. In as- 
tronomy, GPUs have proven useful in many different com- 
putationally intensive problems such as N-body simulations 
( [Tanikawa et al. (2012)| |Nitadori and Aareth (2012)) l and radio 
astronomy measurements ( Clark et al. (201 1) i. GPU techniques 
have succeeded in reducing compute times for these difficult cal- 
culations by up to a factor of 100, and in this work we show that 
a similar reduction can be achieved for the calculation of cosmo- 
logical quantities. 

The next generation of large-scale astronomical surveys 
(such as DES, PanSTARPvS, and LSST) will produce enor- 
mous amounts of data, with measurements of billions of stars 
and galaxies. The problems associated with processing such 
a large volume of image data and how to structure and ac- 
cess a database containing tens of petabytes of information 
has been studied and discussed at length in the literature (e.g. 
|Way (201 l)||Berriman (20TT)l|Brunner et al. (2001)) . In this pa- 
per, we address the challenges an astronomer will face attempt- 
ing to analyze this information, once it has been obtained from a 
central database. With such a large volume of data the statistical 
uncertainties on many cosmological quantities will be reduced 
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by orders of magnitude compared to present limits, but in order 
to take advantage of this data new computation methods must be 
developed. Methods used today to calculate cosmological quan- 
tities tend to be of complexity 0(n 2 ) (where n is the number of 
data points), which is not computationally feasible with the bil- 
lions of measurements expected from future surveys, even with 
the expected improvements in computer hardware. 

Many cosmological calculations require independent calcu- 
lations of the same quantity for all data points, which makes 
them ideal candidates for parallelization (where each calculation 
is farmed out to a different processing device). In the past this has 
often been handled by using many CPUs in a computing cluster 
environment, but building these clusters or buying time on these 
systems can be expensive for researchers. GPU computing has 
brought a significant amount of this computational power to the 
desktop and is therefore more affordable for individual analysts. 
As GPU computing develops and becomes more widely used by 
the broader astronomy community, we foresee performing cal- 
culations in the near future that are currently computing-limited. 
In this paper, we describe the GPU implementation of two of 
these cosmological calculations, the two-point angular correla- 
tion function and aperture mass statistic, using the CUDA pro- 
gramming language (Ni ckolls et al. (2008)| >. 

Large-scale structure in the universe is a valuable probe of 
the composition and evolution of matter in the universe, and can 
be used to constrain models of cosmology. Galaxies are good 
tracers of the total matter in the universe; although we can only 
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directly detect the luminous matter, galaxies form around con- 
centrations of dark matter. We can therefore characterize large 
scale structure of matter in the universe using the clustering of 
galaxies on different length scales, which is measured using the 
angular correlation function. The two-point angular correlation 
function (or matter power spectrum in Fourier space) is based 
on galaxy number counts, and measures the excess or depletion 
of pairs of galaxies as a function of separation, compared to to 
a random distribution. At small scales (« lOO/i -1 M pc) we can 
measure the imprint of the baryon acoustic oscillations, which 
gives information on the phases of acoustic waves at recombi- 
nation. At larger scales the matter power spectrum has not been 
affected by radiation or baryons, so it is a rare probe of primor- 
dial fluctuations and inflation. See |Bassett and Hlozek (2009)| 
for a full review of the subject. Calculating a correlation func- 
tion over billions of galaxies, and at separations ranging from 
arcseconds to degrees, requires significant computational power, 
which scales with the square of the number of galaxies. As such 
it is an excellent candidate for implementation on the GPU. 
As we were preparing this paper, we became aware of work 
by |Ponce et al. (2012)| presenting a method and code for the cal- 
culation of the two-point correlation function on the GPU using 
CUDA. We take a different approach to the implementation of 
the algorithm. 

Dark matter cannot yet be detected directly, but the effect of 
its gravitational field can be measured indirectly using gravita- 
tional lensing. When light travels through the universe, its path 
is deflected by the gravitational potential of the matter it passes. 
The distortion of the observed shapes of distant galaxies as their 
light passes through matter in the universe encodes information 
about large-scale structure and the growth of matter in the uni- 
verse. However, since galaxies have intrinsic shapes, it is only 
through statistical analysis of large numbers of galaxies that we 
can average out their intrinsic shapes and orientations and ex- 
tract the cosmological information. There are many ways to in- 
terpret this information (see Bar telmann and Schneider (200 1)| 
for a review), but for the purposes of this paper we concen- 
trate on the shear peak statistic. Background galaxies tend to 
have the major axis tangential to foreground mass density con- 
tours, so concentrations of matter along the line of sight can 
be detected from the coherent distortions in the ensemble of 
their shapes. Over-densities will appear as peaks in a map of 
aperture mass. Counting the number of peaks as a function of 
peak significance allows us to distinguish between different cos- 
mologies (|Dietrich and Hartlap (2010)||Kratochvil et al. (2010)| 



|Maturi et al. (2010)| l. We calculate the aperture mass at a point 
by making a weighted sum over the tangential components of 
the ellipticity of the surrounding galaxies. Typically this sum 
contains tens or hundreds of thousands of galaxies, and must be 
performed on a dense grid of points over the sky in order to ac- 
curately reconstruct the projected mass field. The sum for each 
point is independent, and can be performed in parallel making it 
another good candidate for implementation on the GPU. Recent 
work by [Leonard et al. (2012)| has used wavelet transformations 
to speed the calculation of the aperture mass, but as far as we are 
aware this is the first implementation on the GPU. 

This paper is structured as follows. In Section|2]we introduce 
the two-point angular correlation function and describe its im- 
plementation on the GPU. We then compare its performance to 
the same calculation on the CPU using data taken from dark mat- 
ter simulations. In Section|3]we describe the aperture mass cal- 
culation and how it is implemented on the GPU and compare its 
performance to the CPU implementation. In Section [4] we sum- 
marize our findings. Access to the code is given in Appendix ICl 



2. Two-point angular correlation function 

The combined forces of gravitational attraction and dark en- 
ergy influenced the clumping of dark matter, and accompany- 
ing galaxy distributions, that we observe today. To quantify 
this clumping or clustering, one can make use of the angular 
correlation function w{8), which relates the probability 5P of 
finding two galaxies at an angular separation 8 on the observ- 
able sky to the probability for a random distribution of galaxies 
( [Peebles (1980)) : 



SP = N 2 [\ + w{ff)]da x dQ. 2 , 



(1) 



where dQ.\ and dQ.2 are elements of solid angle and N is the 
mean surface density of objects. The full calculation for the 
angular separation for any two objects in the sky is shown in 
Appendix lAl 

The estimator for the angular correlation function w(8) is cal- 
culated from the positions of galaxies in the sky. Three quantities 
go into this calculation: DD (data-data), RR (random-random), 
and DR (data-random). For an angular separation 8, DD is the 
number of pairs of galaxies in data separated by 8. A random dis- 
tribution of galaxies is generated over the same region of the sky 
as the data from which DD is calculated, and this random sam- 
ple is used to calculate RR in the same fashion as DD. DR is the 
cross-correlation between these two datasets, i.e., the number of 
galaxies in the random distribution that are an angular distance 
8 from the galaxies in the data sample. While there are different 
estimators for w(8), we have chosen to work with a widely ac- 
cepted estimator from |Landy & Szalay (1993) though we show 
that the code presented in this paper allows the user to easily ex- 
periment with other estimators. The estimator w(8) is calculated 
as 



w(8) = 



DD - 2DR + RR 



RR 



(2) 



To calculate DD, the analyst must calculate the angular sep- 
aration from every galaxy to every other galaxy, taking care to 
not calculate the separation of a galaxy from itself or to double 
count by calculating the distance from galaxy, to galaxy j and 
then from galaxy ; - to galaxy,. Given a catalog of n galaxies, there 
are n(n - l)/2 calculations. There is an analogous number of cal- 
culations for n r randomly distributed galaxies in the calculation 
of RR. To calculate the DR cross-correlation term, there are nn r 
calculations as every observed galaxy must be correlated with 
every simulated galaxy. 

The full calculation of DD, RR, and DR is effectively an 
0(n 2 ) operation. While there are techniques that speed up the 
calculation for CPUs by making approximations for the density 
of galaxies at larger angular separations ( |Jarvis et al. (2004)) , we 
show here how the exact calculation can be implemented on the 
GPU. 

2.1. GPU Implementation 

For the GPU implementation, we make use of the CUDA pro- 
gramming language (Nickoll s et al. (20 08)), a widely-used lan- 
guage developed by NVIDIA for use with their GPUs. In CUDA 
terminology, threads are the individual calculations performed 
in parallel by the GPU, and a kernel is the code that defines the 
calculation and specifies the number of threads that will run this 
calculation. To calculate the angular separations that go into DD, 
we have noted that there are n(n - l)/2 calculations when dou- 
ble counting and self-separations are avoided. An example ma- 
trix of these calculations is shown in Fig. 1, where each element 
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represents the calculation of distance between two galaxies. To 
calculate DD, we histogram these values, using some predeter- 
mined angular range for the binning. Histogramming on GPUs 
is nontrivial and NVIDIA provides examples of how this can be 
done (|Podlozhnyuk, V. (2007)[>, but we found these difficult to 



adapt to our routines. Therefore we implemented our own so- 
lution, which, while slower than the highly optimized NVIDIA 
code, is very flexible and adaptable. 

The amount of memory available to the GPU is usually less 
than that available to the CPU, but for these calculations this is 
not a limiting factor if GPU global memory is used for the initial 
transfer to the CPU, and GPU local memory is used only for 
the histogramming. Execution of GPU code can be dramatically 
slowed when multiple threads access the same memory address. 
For this reason, we adopt a very trivial algorithm for breaking 
down the full calculation into smaller packets, and at the same 
time using atomic operations for the histogramming. 

The coordinates for all the galaxies are copied to the GPU in 
global memory. The matrix of calculations is broken into square 
submatrices as shown in Fig. 1. We loop over each of these sub- 
matrices, each time launching the same kernel with the matrix 
indices that define the location of that chunk in the larger matrix 
of calculations. Each column of these calculations is calculated 
by a different thread. This means that some threads have fewer 
calculations to perform than others, but we have not attempted 
to try to optimize this performance. 

Shared memory is allocated for each block for the histogram- 
ming. The histogram is simply an array of integers used to hold 
the number of galaxies with a particular angular separation. We 
increment the entries of the histogram using the atomicAddO 
operation in CUDA. There could still be serialization occurring 
if multiple threads calculate a similar angular separation, requir- 
ing the results to be put into the same bin and therefore accessing 
the same memory address, but we have not found that this causes 
a significant slow down. After all the distances are calculated and 
recorded in the shared memory histogram they are summed over 
all the blocks and copied over to global memory. At this point, 
the kernel returns to the CPU, where the histogram is copied to 
the GPU memory and added to a histogram in the CPU memory. 
The full calculation is complete when we have walked through 
all of the submatrices. A summary of the process to calculate 
DD, RR, or DR can be stated as follows. 

1. Copy vectors of galaxy positions to the 
GPU global memory. 

2. Determine size and position of submatrices 
of the calculation. 

3. Launch kernel for each submatrix: 

Allocate local memory for histogramming 
on each block. Each thread calculates 
the angular separation for a column 
of the submatrix. 

4. Sum the local memory histogram arrays in 
global memory. 

5 Copy the histogram arrays back to the CPU 

and sum them. 
6. Continue to loop over all the submatrices 

until the entire calculation is done. 

The code allows the user to both set the low and high range 
limits of the histogram and choose one of three preset bin- 
widths: evenly spaced bins, logarithmic binning, and logarith- 
mic base-10 binning. Logarithmic binning allows finer bins to 
be used at small angular separations, where we have the most 



i th galaxy 



i th galaxy 



Fig. 1. Graphical representation of the matrix of angular dis- 
tances calculated for DD or RR. Each dot represents a single an- 
gular distance between a pair of galaxies. The number of galax- 
ies, n, is 40 in this figure. The left-hand plot shows the neces- 
sary calculations when self-distances and double counting are 
accounted for. The right-hand plot shows how the calculation is 
broken down on the GPU. Each smaller triangle of calculations 
is its own kernel launch. An individual thread will handle the 
calculations in one column of a submatrix. The filled black cir- 
cles or open red circles (color available online) are only to guide 
the reader's eye. 



i"' galaxy 



i"' galaxy 



Fig. 2. Graphical representation of the matrix of angular dis- 
tances calculated for DR. Each dot represents a single angular 
distance between a pair of galaxies. The number of galaxies, n, is 
40 in this picture. The right-hand plot shows how the calculation 
is broken down on the GPU. Each smaller square of calculations 
is its own kernel launch. An individual thread will handle the 
calculations in one column of a submatrix. The filled black cir- 
cles or open red circles (color available online) are only to guide 
the reader's eye. 



data. We have found that other groups will use different binning 
when performing their calculations and so we have built in this 
flexibility to make it easier for direct comparisons. The time it 
takes to process a dataset can depend on these parameters due 
to the algorithm we have chosen to implement. Specifically, if 
we select a wide range over which to histogram the distances, 
the time may be slightly longer because more calculated dis- 
tances need to be dropped into the histogram. If the binning is 
very coarse, the time may also increase because of serialization 
issues in memory access. 

We note that for the DR calculation, there is no double count- 
ing or self-distances to worry about. In this case there are about 
twice as many calculations to be done, but the procedure for 
breaking up the matrix into submatrices is still the same. We 
show a visualization of this process in Fig. 2. 
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2.2. Performance 

We test the consistency of the calculations and measure the in- 
crease in speed by comparing our GPU implementation to a 
straightforward implementation on a CPU using the C program- 
ming language. The details of both the GPU and CPU computing 
hardware and environment we used are given in Appendix iBl 

The dataset we used is a subset of a dark matter N-body 
simulation that is subsequently "decorated" with galaxies such 
that the galaxies realistically trace the underlying dark mat- 
ter distribution ( W echsler et al. in prep| Busha et al. in prep i. We 
use galaxies from the simulation with a red shift of z < 0.1. 
The dataset is read in from a text file that has two columns: 
the right ascension (RA) and declination (DEC) in arc minutes 
for each galaxy. The angular range covered in this dataset is 
0° < RA < 90° and 0° < DEC < 90° and we generate a flat 
distribution of galaxies over the same range. We use this dataset 
to demonstrate the speed of the full calculation for different num- 
bers of galaxies. 

The first step is to show that the GPU implementation gives 
the same values as those calculated with the CPU. We compare 
the two implementations using 10 5 galaxies for the DR com- 
ponent, which requires « 2 = (10 5 ) 2 = 10 10 calculations. We 
use 254 bins and logarithmic binning with a wide range of his- 
togram bins. There are 172 non-empty bins and only 30 of them 
have exactly the same numbers when the CPU or GPU are used. 
However, closer inspection of bins with different numbers of en- 
tries shows they are almost exactly the same. The fractional dif- 
ference between the bins is always between 10~ 3 - 10~ 8 , with the 
smaller fractional differences found in bins with a large number 
of entries. We attribute this discrepancy to either differences in 
floating point implementation or trigonometric and logarithmic 
implementations on the CPU and GPU. 

For timing comparisons using the CPU and GPU, we choose 
the DR calculation for 10 4 , 10 5 , and 10 6 galaxies. We use 254 
bins and logarithmic binning. We note that to calculate either DD 
or RR each require about one-half the number of calculations as 
DR and so the full calculation for w{9) should take about twice 
the amount of time shown here, given only one CPU or GPU. 
The times are calculated using the UNIX/Linux command time, 
which gives the total time required to execute any command and 
return control to the user. While this gives less fine-grained in- 
formation about where the programs spend their time, we choose 
this method as it gives the best sense of how much time it costs 
the analyst to actually make these calculations. While both the 
CPU and GPU implementation spend most of their time actually 
calculating the angular separations, this is a slightly smaller per- 
centage of the total time when using the GPU. Therefore, when 
testing with small numbers of galaxies, the GPU implementation 
does not quite increase as n 2 , . . The results are summarized in 

~ galaxies 

TableQ] For 10 5 galaxies, we note that there is a 140x increase in 
speed when using the GPU and for 10 6 galaxies a 180x increase 
in speed. 

For completeness we ran a variety of DR calculations with 
the 10 5 galaxy sample, where we vary the number of bins, the 
range of bins, and whether or not we use equal bin widths or 
logarithmic binning (natural and base 10). As we mentioned be- 
fore, if too many threads are accessing the same histogram bin, 
then the calculation can take longer. In addition, if the range of 
the bins is less than the range of the data, time can be saved 
by ignoring calculations that would fall outside the bin range. 
Running over a variety of these combinations, we found that the 
DR calculation for 10 5 galaxies can take between 1 1 and 57 sec- 



Table 1. Speed of angular correlation calculations when per- 
formed on either the CPU or GPU, for different numbers of 
galaxies. The times are for the DR component; there are n 2 
calculations performed, where n is the number of galaxies. All 
times are in seconds. These tests used a histogram with 254 bins 
and logarithmic binning. Compilation details can be found in 
Appendix|B] 





10 4 galaxies 


10 3 galaxies 


10° galaxies 


CPU 


23.9s 


2305s 


231,225 s 


GPU 


0.2s 


16.0s 


1,254s 



onds. Therefore users should be aware that depending on their 
choice of binning, timing results may vary. 

3. Aperture Mass 

As described in Section [1] light from distant galaxies is lensed 
by structures along the line-of-sight, and therefore the statistical 
ensemble of observed galaxy shapes encodes information about 
the matter power spectrum of the universe. This information can 
be used to constrain models of cosmology, which predict differ- 
ing amounts of structure on different mass and distance scales, 
and in different epochs. 

The tidal gravitational field of matter along the line-of-sight 
causes the shear field to be tangentially aligned around projected 
mass-density peaks. This alignment can be used to detect matter 
over-densities by constructing the aperture mass (M ap ) described 
in |Schneider (1996)| which is the weighted sum of the tangen- 
tial components of ellipticity of the galaxies (e,) surrounding a 
position in the sky (Oq): 



M ap (8 ) = — V Q(8de it , 



(3) 



If the filter function Q has a shape that follows the expected shear 
profile of a mass peak, then the aperture mass is a matched filter 
for detecting these peaks. Q can have a generic (e.g. Gaussian) 
shape, or be optimized for detections of halos with an NFW den- 
sity profile ( |Navarro et al. (1997)||Schirmer et al. (2007)) : 



Qnfw(x) = 



1 



tanh(x/0.15) 



1 



,6-150* + e -47 + 50x x / .15 



(4) 



where x = 6il6, mx , and 6 max defines the radius of the filter. 
See |Maturi et al. (2010)| for a detailed study of different filter 
shapes and their impact on cosmological constraints. The filter 
radius is chosen to maximize the sensitivity to cosmology, and 
typically several different filter sizes are combined to obtain in- 
formation on both large-scale and smaller-scale structures in the 
matter density of the universe. We look for peaks in the map of 
signal-to-noise ratio (SNR), where the noise can also be calcu- 
lated directly from the data: 



SNR(0 o ) 



(5) 



The easiest way to implement this algorithm on the CPU is 
with compiled C code, using nested loops over vectors contain- 
ing the measured galaxy parameters. We can reduce the time 
required for calculation by considering only galaxies that have a 
significant contribution to the aperture mass. 
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The complexity of this algorithm is oc 0(n pts * n ga i s ), where 
ripts is the number of points in space for which we wish to eval- 
uate the aperture mass (e.g. a grid of 512xx512 points), and 
rig a i s is the number of galaxies that significantly contribute to 
the aperture mass. Since we ignore galaxies far from the recon- 
struction point where the value of the filter function goes to zero, 
n ga h is proportional to the filter radius squared, 9} nax , as given in 
Equation|4] 

This becomes a computational problem with a large dataset, 
since we must make a nontrivial calculation of aperture mass 
for all surrounding galaxies for each point on the sky. For small 
area or shallow surveys this is a feasible calculation on a desktop 
machine, provided we use a low-density grid of reconstruction 
points. Larger, deeper surveys, which contain a higher galaxy 
density and a larger area over which to calculate the aperture 
mass, become problematic. 



3.1. GPU Implementation 

The implementation on the GPU is straightforward, and sim- 
ply parallelizes the brute-force method described above, us- 
ing CUDA. First, the vectors of galaxy parameters are copied 
into global memory on the GPU. The kernel is then launched, 
where each thread calculates the contributions of the surround- 
ing galaxies for one point on the reconstruction grid, summing 
the contributions and returning the resulting aperture mass and 
SNR. The number of threads required is equal to the number 
of points in the reconstruction grid. The process can be broken 
down as follows: 

1. Copy vectors of galaxy positions and shear 
components to the GPU global memory. 

2 . Launch kernel : 

Each thread calculates the aperture mass 
for one point on the grid, looping over 
the surrounding galaxies and summing the 
contributions to the aperture mass. 

3. Copy vector of aperture mass from GPU 
global memory. 

The complexity of this algorithm is 0(n go i s ), or 0(8f nax ), 
since we parallelize the calculations over the grid of reconstruc- 
tion points. 

GPU units designed for scientific computing have a large on- 
chip memory. We use a Tesla M2070 GPU card with Fermi ar- 
chitecture and 5.25GB on-chip global memory (with ECC on). 
See AppendixlBlfor more details on the system architecture. This 
is sufficient memory to store the positions and shear components 
for tens of millions of galaxies. The more important limit in the 
amount of data we can process in a single kernel launched on 
the GPU is the number of threads that can be launched simul- 
taneously. This will eventually limit the possible density of the 
reconstructed aperture mass map. By splitting the sky into over- 
lapping segments and calculating the aperture mass for each seg- 
ment sequentially we can easily circumvent this issue. However, 
for the survey size and reconstruction grid density we consider in 
this work, it has not been necessary to do this. All performance 
numbers in Section [3~2l are given for all data processed in a sin- 
gle kernel launch with no segmentation of the dataset required. 

3.2. Performance 

We compare the performance of the CPU and GPU algorithms to 
analyze realistic shear data. We create input catalogs from sim- 




25 30 
Filler Radius (arcmin) 

Fig. 3. Calculation time for CPU code (top, black) and GPU code 
(bottom, red) for increasing filter size. The solid lines correspond 
to a reconstruction grid of 512x512 points, dashed lines are 
for 1024x1024 points and dotted lines for a grid of 2048x2048 
points. 



ulated shear fields created from a ray-traced N-body dark mat- 
ter simulation from |Kratochvil et al. (2012)| 



Yang etal. (2011) 



d. We use a galaxy 



with model galaxies as tracers of the shear fiel 
density of 35 galaxies per arcmin 2 (which is roughly the galaxy 
density expected to be used in LSST weak lensing analyses 
( |Wittman et al. (2009)| >. There are a total of 10 6 galaxies in our 
dataset, which represents an area of sky of ~ 12 deg 2 . Using 
more or less galaxies does not impact the computational time, 
only the memory usage and the time required for transferring 
data to the GPU, so we compare compute times for different fil- 
ter sizes and different densities of the reconstruction grid. Table|2] 
shows the speed for the CPU and GPU implementation of the 
aperture mass code in these different cases. As for the timing 
tests for the angular correlation function, we include in our GPU 
timing the time required for all data to be copied to and from the 
GPU global memory. This comprises ~ 3 seconds and is constant 
for all filter and grid sizes. The aperture mass maps obtained 
from both methods are essentially identical. 



Table 2. Speed of calculation for CPU and GPU code for in- 
creasing density of reconstruction grid, and for increasing filter 
radius. The GPU code gives calculation times 100 - 300 times 
faster. 





2' 


4' 


8' 


16' 


32' 






CPU 








512x512 grid 


558s 


781s 


1,478s 


4,108s 


12,794s 


1024x1024 grid 


2,105s 


2,627s 


5,204s 


15,311s 


48,474s 


2048x2048 grid 


8,855s 


11,941s 


22,546s 


61,406s 


189,861s 






GPU 








512x512 grid 


lis 


12s 


15s 


23s 


45s 


1024x1024 grid 


38s 


41s 


50s 


78s 


162s 


2048x2048 grid 


143s 


156s 


190s 


297s 


627s 



Using the GPU code decreases the calculation time by lOOx- 
300x compared to the calculation on the CPU. Recent work 
by |Leonard et al. (2012)| has produced a very fast method of 
calculating the aperture mass using a wavelet transformation. 
They achieved speed-up factors from 5x-1200x, depending on 
filter radius, for an image of 1024x1024 pixels. We take a 
very different approach to the aperture mass calculation, but 
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we find similar improvements in speed, with the advantage 
that our approach calculates the noise directly from the data 
(as outlined in Bartelmann and Schneider (2001)| l, whereas the 
|Leonard et al. (2012) technique requires randomization of the 
data and repeated measurements to obtain the uncertainty. Our 
approach saves considerable computing time. The timings given 
in this section include the time necessary for calculation of the 
noise. 



We compile our code using CUDA version 4.10 (both driver 
and runtime) and gcc version 4.1.2. The operating system is 
Scientific Linux, version SL5, using kernel 2.6.18. 

For the timing comparisons, we write the CPU code using 
C/C++ and compile using the gcc compiler with the optimiza- 
tion flags -01. This flag gives about a 30% improvement over 
not using this flag. Greater degrees of optimization (-02 , -03) 
did not give any additional increase in speed. 



4. Summary 

We have implemented code to perform calculations of the two- 
point angular correlation function and the aperture mass statis- 
tic on the GPU. We have demonstrated that this implementa- 
tion can reduce compute times for these calculations by fac- 
tors of 100x-300x, depending on the amount of data to be pro- 
cessed. The code for making this calculation is publicly available 
fromGithub (see AppendixIClfor details). Faster compute speeds 
mean that a full MC -based calculation of the errors for the an- 
gular correlation function can reasonably be performed, with- 
out approximations or assumptions that are required to make the 
calculation reasonable for CPU codes (e.g. kd-tree calculation 
Parvis et al. (2004)) ). We intend to evaluate this in future work. 

The increasing size of astronomical datasets will require a 
new approach to data analysis. We expect that the use of the 
GPU in everyday cosmological calculations will become more 
common in the next few years, especially since faster compute 
times allows experimentation in techniques used to make the cal- 
culation and rapid comparison of the results. We expect this ap- 
plication to be extended to other computationally challenging 
calculations, such as the three-point and higher order angular 
correlation functions, and the shear correlation functions. 
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Appendix A: Angular separation calculation 

The angular distance between two galaxies can be calculated us- 
ing: 



Aa = oci — ct\ 
A = cos 2 5z sin 2 Aa 
B = cos 6\ sin £2 - sin^i cos 62 cos Aa 
C = sin 61 sin £2 + cos^i cos 62 cos Aa 
180° 



( tJaTW) 

6 = arctan 

n C 



(A.l) 
(A.2) 
(A.3) 
(A.4) 

(A.5) 



Appendix C: Code repository 

Our code is publicly available on Github, a software hosting ser- 
vice that uses git for version control. The repository can be 
found at https://github.com/djbard/ccogs And can be 
cloned by anyone who has git installed on their system. 

Along with the code, we have provided sample datasets and 
scripts to run and test your installation. Each package has its own 
README that details how to build and run the software. Problems 
or improvements can be directed to the authors. 

This software is licensed under the MIT License. 
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where 6 is the angular separation between any two galaxies in 
degrees and (a, , <5,) is the right ascension and declination of the 
/ th galaxy. 



Appendix B: Computing environment 

The GPU implementations for both algorithms were tested on a 
Tesla M2070. This device has 5375 MB of global memory (with 
ECC turned on) and 49152 bytes of shared memory. The clock 
speed for the GPU processors is 1.15 GHz. 

For tests using the CPU, we run on an Intel Xeon E5540 
processor with a 2.53 GHz clock speed and 3 GB of on-board 
memory. 
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